fix(webapp): auto-recover replication services after stream errors#3613
fix(webapp): auto-recover replication services after stream errors#3613ericallam wants to merge 4 commits into
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds configurable error recovery for the runs and sessions replication services. When a logical replication stream fails (e.g., during a database failover), the system can reconnect with exponential backoff, exit to let an external supervisor restart the host, or remain stopped with logging. Environment variables control per-service strategy selection and tuning. The implementation integrates into both services' lifecycle (on error, stream start, and shutdown) and is validated through containerized integration tests that force replication stream failures. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6f8cc24 to
5ba46ff
Compare
bc57072 to
969dbdb
Compare
969dbdb to
964b7e4
Compare
When the underlying logical-replication client errored (e.g. after a Postgres failover), the runs and sessions replication services logged the error and left the stream stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind. Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own: - "reconnect" (default) re-subscribes via the existing subscribe(lastLsn) path with exponential backoff (1s -> 60s cap, unlimited attempts), which re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN. - "exit" calls process.exit after a short flush window so a host's supervisor (Docker restart=always, systemd, k8s, etc.) can replace the process. - "log" preserves the historical behaviour. Per-service strategy + exit knobs are env-driven via RUN_REPLICATION_ERROR_STRATEGY / SESSION_REPLICATION_ERROR_STRATEGY plus matching *_EXIT_DELAY_MS / *_EXIT_CODE. Reconnect tuning is shared across both services via REPLICATION_RECONNECT_INITIAL_DELAY_MS / _MAX_DELAY_MS / _MAX_ATTEMPTS (0 = unlimited).
Addresses PR review feedback:
- LogicalReplicationClient.subscribe() can throw before its internal
"error" listener is wired up (notably when pg client.connect() fails
mid-failover). The reconnect strategy's catch block only logged, so
recovery silently stopped. Now also calls scheduleReconnect(err) — the
pendingReconnect guard makes it idempotent if an error event was also
emitted.
- Reject negative values for the new replication-recovery env vars and
cap exit codes at 255.
- Convert the new ReplicationErrorRecovery{Deps,} interfaces to type
aliases to match the repo's TypeScript style.
- Tighten the reconnect dep comment to drop a stale "lastAcknowledgedLsn"
reference (the wrapper-tracked resume LSN is what callers actually pass).
- Restore process.exit after service.shutdown() in the exit-strategy
test so a delayed exit timer can't terminate the test worker.
LogicalReplicationClient.subscribe() can resolve without throwing or emitting an "error" event when leader-lock acquisition fails — it just calls this.stop() and returns. The reconnect callback now checks isStopped after subscribe() and throws so the recovery handler can schedule the next attempt instead of silently giving up.
…rough handle() The previous post-subscribe() isStopped check was always true on the happy path: subscribe() calls stop() up front (setting _isStopped=true) and only resets the flag inside the replicationStart event, which fires asynchronously after subscribe() returns. So the check threw on every successful reconnect, the catch rescheduled, the next attempt tore down the just-built client, and the cycle continued — replication briefly worked between teardowns, which is why the integration test passed. Replace it with the correct nudge: subscribe to leaderElection and call the recovery handler on isLeader=false. That's the only subscribe() exit path that doesn't either throw or emit an "error" event (the other silent-return paths emit "error" first via createPublication/createSlot failures).
964b7e4 to
a2eaf3e
Compare
| if (!isLeader) { | ||
| // Failed leader election doesn't throw or emit an "error" event — | ||
| // subscribe() just emits leaderElection(false), calls stop(), and | ||
| // returns. Nudge the recovery handler so reconnect doesn't silently | ||
| // stall when another instance holds the lock. | ||
| this._errorRecovery.handle( | ||
| new Error("Failed to acquire replication leader lock") | ||
| ); | ||
| } |
There was a problem hiding this comment.
🔴 leaderElection(false) triggers process exit in "exit" strategy, causing restart loops for non-leader instances
The new leaderElection event handler unconditionally calls this._errorRecovery.handle(...) when isLeader is false. For the "reconnect" strategy, this correctly schedules a retry with backoff. However, for the "exit" strategy, this calls scheduleExit(), which terminates the process.
In a multi-instance deployment, only one instance wins the leader election — losing is a normal operational scenario, not an error requiring process restart. With RUN_REPLICATION_ERROR_STRATEGY=exit, every non-leader instance will exit immediately after failing the initial leader election in start(), creating an infinite restart loop (exit → supervisor restart → fail election → exit → …).
How it happens step-by-step
start()callssubscribe()(runsReplicationService.server.ts:329)subscribe()fails leader election → emitsleaderElection(false)(client.ts:257)- The handler at line 289 calls
this._errorRecovery.handle(new Error(...)) handle()dispatches toscheduleExit()(replicationErrorRecovery.server.ts:136-137)scheduleExit()setsexiting = trueand schedulesprocess.exit(code)after a delay- Process terminates, supervisor restarts, cycle repeats
The handler's comment acknowledges it was designed for the reconnect strategy ("Nudge the recovery handler so reconnect doesn't silently stall"), but the dispatch is unconditional across all strategies.
Prompt for agents
The leaderElection(false) handler at runsReplicationService.server.ts:289-297 and the identical one at sessionsReplicationService.server.ts:270-275 unconditionally call this._errorRecovery.handle() for failed leader elections. This is only appropriate for the "reconnect" strategy. For the "exit" strategy, it causes the process to terminate whenever it fails to acquire the leader lock — including during the initial start() in a multi-instance deployment, causing a restart loop.
Possible fixes:
1. Guard the handler so it only fires for the reconnect strategy — e.g. check options.errorRecovery?.type === "reconnect" before calling handle(). This keeps the exit and log strategies unaffected by leader election contention.
2. Alternatively, introduce a separate method on ReplicationErrorRecovery (e.g. handleLeaderElectionLoss()) that only the reconnect strategy acts on, leaving exit and log as no-ops.
3. Apply the same fix to both RunsReplicationService and SessionsReplicationService.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
When the logical-replication stream errored (most commonly after a Postgres failover), the runs and sessions replication services logged the error and left the underlying client stopped. The host process kept running, the WAL backed up, and ClickHouse silently fell behind.
Fix
Both services now run a configurable recovery strategy on stream errors, defaulting to in-process reconnect with exponential backoff so a fresh self-hosted setup heals on its own.
reconnect(default) — re-subscribe with exponential backoff (1s → 60s cap, unlimited attempts).LogicalReplicationClient.subscribe(lastLsn)re-validates the publication, re-acquires the leader lock, and resumes from the last acknowledged LSN.exit—process.exit(1)after a short flush window so a host supervisor (Dockerrestart=always, systemd, k8s) can replace the process.log— preserves the old behaviour.Per-service strategy + exit knobs are env-driven (
RUN_REPLICATION_ERROR_STRATEGY/SESSION_REPLICATION_ERROR_STRATEGY+*_EXIT_DELAY_MS,*_EXIT_CODE). Reconnect tuning is shared across both services (REPLICATION_RECONNECT_INITIAL_DELAY_MS,_MAX_DELAY_MS,_MAX_ATTEMPTS;MAX_ATTEMPTS=0means unlimited).Test plan
Integration tests cover all three strategies by simulating a failover with
pg_terminate_backendagainst the WAL sender:reconnect— kill the backend, insert a new row, assert it lands in ClickHouseexit— kill the backend, assertprocess.exit(1)is calledlog— kill the backend, insert a new row, assert it does not land in ClickHousepnpm --filter webapp test --run runsReplicationService.errorRecovery